Add initial end-to-end CUDA FGMRES solver path by LwhJesse · Pull Request #2825 · su2code/SU2

LwhJesse · 2026-06-01T14:37:52Z

Proposed Changes

This PR adds an initial end-to-end CUDA FGMRES linear solve path on top of the existing CUDA BSR SpMV path.

It intentionally bundles the minimal pieces required for a reviewable GPU linear-solve slice, rather than sending the intermediate infrastructure-only pieces separately. The scope is limited to one GPU Krylov solver path (FGMRES), one simple GPU preconditioner path (JACOBI), and the vector operations and dispatch/lifecycle changes strictly required to make that path run.

Concretely, this PR:

caches the cuSPARSE SpMV resources needed by the solver path
adds CUDA FGMRES scaffolding and internal dispatch while keeping the public solver entry point unchanged
adds the CUDA vector primitives needed by the solver path
implements an initial CUDA FGMRES solve path while sharing the FGMRES control flow with the host solver
adds a simple CUDA Jacobi preconditioner path
keeps cuSPARSE for SpMV
keeps cuBLAS for dot / norm
dispatches CSysVector expression-template vector updates to custom CUDA kernels instead of exposing a parallel solver-visible GPU vector algebra API

This PR does not attempt to add more GPU Krylov solvers, more advanced GPU preconditioners, remove the current host-driven Krylov control flow, or perform broader cache / portability / cleanup work beyond this minimal slice.

Related Work

This PR follows the review direction discussed in #2822, where the request was to show a working end-to-end GPU linear solve path before splitting out additional infrastructure work.

It also follows the implementation preferences discussed in #2816:

cuSPARSE for SpMV
cuBLAS for dot / norm
custom CUDA kernels for vector-vector operations

Suggested review order:

53bacf193f Cache CUDA SpMV cuSPARSE resources
08fde80e1e Add CUDA FGMRES and Jacobi scaffolding
fde2c145cf Implement CUDA vector primitives
2b4f9d8716 Implement CUDA FGMRES solve path
9c344ee793 Implement CUDA Jacobi preconditioner
a875f56767 Share FGMRES control flow with CUDA vector dispatch

Validation

Validated locally with:

python3.12 -m pre_commit run --all-files
serial CUDA build compilation
mixed-precision CUDA build compilation
serial CPU build compilation
OpenMP CPU build compilation
CPU/GPU numerical comparison on 6 representative cases, each tested with LINEAR_SOLVER_PREC=NONE and LINEAR_SOLVER_PREC=JACOBI
nsys profiling
ncu profiling

Representative cases used for validation:

periodic2d_sector
udf_lam_flatplate_s
udf_lam_flatplate_m
udf_lam_flatplate_l
udf_test_11_probes_s
udf_test_11_probes_m

In short: this branch compiles, the end-to-end CUDA FGMRES path runs successfully on the tested cases, and the GPU-side results are numerically consistent with the CPU-side results. Across the tested cases, the CPU and GPU residual histories either match exactly or differ only at floating-point roundoff level.

Performance was also checked on the same representative cases against both a serial CPU build and a 20-thread OpenMP CPU build. The GPU path is faster than the serial CPU baseline on the medium and large cases tested here. Against the 20-thread OpenMP CPU baseline, it is not beneficial on the smallest cases, but still shows a clear speedup on the medium and large cases tested here.

The simple Jacobi path is numerically valid, but is not yet a net performance win on these cases.

PR Checklist

I am submitting my contribution to the develop branch.
My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
I have added a test case that demonstrates my contribution, if necessary.
I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

Move the FGMRES iteration into one shared implementation and select host or CUDA vector-operation backends from the existing solver entry point. Keep the CUDA path GPU-resident for SpMV, Jacobi, dot/norm, and vector updates, with only scalar reductions and final solution synchronization crossing back to the host.

pcarruscag · 2026-06-21T02:00:39Z

+namespace VecExpr {
+
+enum class DeviceAssignOp { Assign, Add, Subtract, Multiply, Divide };
+
+template <class Scalar>
+class CDeviceVectorView : public CVecExpr<CDeviceVectorView<Scalar>, Scalar> {


This is a step in the right direction, but what I have in mind is to use CSysVector directly, so that CSysSolve can stay nearly identical and completely agnostic to cpu or gpu.
I think we can do this by specializing the store_t trait for CSysVector, so that the type stored by expressions becomes this view, note that we only need to capture the pointer, the size is defined by the vector on the left-hand side of the assignment or compound assignment.
This way CSysVector either does the CPU loop or launches the GPU kernel in the assignment, according to how the GPU boolean is set.

For clarity, CSysVector stops being stored as a reference and is instead stored "as a view" (which is a value type).

LwhJesse added 5 commits June 1, 2026 01:48

Cache CUDA SpMV cuSPARSE resources

53bacf1

Add CUDA FGMRES and Jacobi scaffolding

08fde80

Implement CUDA vector primitives

fde2c14

Implement CUDA FGMRES solve path

2b4f9d8

Implement CUDA Jacobi preconditioner

9c344ee

LwhJesse marked this pull request as ready for review June 3, 2026 05:20

pcarruscag reviewed Jun 13, 2026

View reviewed changes

Comment thread Common/src/linear_algebra/CSysSolveGPU.cu Outdated

LwhJesse force-pushed the gpu/initial-cuda-fgmres branch from 8982fcb to d821ca0 Compare June 17, 2026 14:24

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread Common/include/linear_algebra/CSysSolveFGMRES.inl Fixed

LwhJesse force-pushed the gpu/initial-cuda-fgmres branch from 592b302 to 5b01f21 Compare June 18, 2026 05:59

LwhJesse force-pushed the gpu/initial-cuda-fgmres branch from 5b01f21 to a875f56 Compare June 18, 2026 11:37

pcarruscag reviewed Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial end-to-end CUDA FGMRES solver path#2825

Add initial end-to-end CUDA FGMRES solver path#2825
LwhJesse wants to merge 6 commits into
su2code:developfrom
LwhJesse:gpu/initial-cuda-fgmres

LwhJesse commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

pcarruscag Jun 21, 2026

Uh oh!

pcarruscag Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LwhJesse commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Related Work

Validation

PR Checklist

Uh oh!

Uh oh!

Uh oh!

pcarruscag Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

pcarruscag Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LwhJesse commented Jun 1, 2026 •

edited

Loading